Personal Loan Campaign

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.

Data Dictionary

Loading Libraries

Load data

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Summary of the dataset.

Lets us look at different levels in categorical variables

Observations:

*

Univariate Analysis

Observations on fnlwgt

Observations on workclass

Observations on marital_status

Observations on native_country

Observations on salary

Bivariate analysis

Summary of EDA

Data Description:

Observations from EDA:

income vs personal loan

salary vs working_hours_per_week

Summary of EDA

Data Description:

Data Cleaning:

Observations from EDA:

Actions for data pre-processing:

Data Pre-Processing

Dropping capital_gain and capital_loss

Data Preparation

Encoding >50K as 0 and <=50K as 1 as government wants to find underprivileged section of society.

Creating training and test sets.

Building the model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a person has a salary <=50K but in reality the salary of the person is >50K.
  2. Predicting a person doesn't have a salary <=50K but in reality the salary of the person is <=50k.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Odds from coefficients

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on training set

Checking performance on test set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Checking model performance on test set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Checking model performance on test set

Model Performance Summary

Conclusion

Recommendations

Sequential Feature Selector (ADDITIONAL)

Note: Kindly do not run the code cells containing the Sequential Feature Selector implementation during the session, since that algorithm takes considerable time to run.

Selecting subset of important features using Sequential Feature Selector method

Why we should do feature selection?

How sequential feature selector works?

Finding which features are important?

Let's look at best 8 variables

Let's Look at model performance

Model Performance Summary

Conclusion

Split Data

Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.

If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become bia

Visualizing the Decision Tree

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Confusion Matrix - decision tree with tuned hyperparameters

Observations from the tree:

Using the above extracted decision rules we can make interpretations from the decision tree model like:

Interpretations from other decision rules can be made similarly

Cost Complexity Pruning

Confusion Matrix - post-pruned decision tree

Decision tree with post-pruning is giving the highest recall on test set.